Each year, thousands of children are reported missing, an issue prevalent across the United States. Typically, the words “missing child” are associated with crimes of kidnapping and murder. However, a child can be reported missing for a variety of reasons. It is important to understand these reasons in order to better address the issue at hand. For our final project, we used data about reported missing children to create interesting visualizations in Tableau and R.
Below we display our sessionInfo().
sessionInfo(package=NULL)
## R version 3.3.3 (2017-03-06)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.3
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.10 stringi_1.1.3
## [9] rmarkdown_1.4 knitr_1.15.1 stringr_1.2.0 digest_0.6.12
## [13] evaluate_0.10
We obtained our data from data.world. Specifically from a user by the name of jamesgray, the data set missing-children-in-the-us was downloaded as our dataset. This data was extracted from The National Center for Missing and Exploited Children (NCMEC). It was released as part of the Cloudera Child Finder Hackathon to develop new methods for finding missing children. The dataset can be found here.
In order to better create and format the data regarding missing children, we built a ETL script to clean-up the data. The source code can be seen below.
source("../01 Data/R_ETL_Final.R")
## Loading required package: readr
## Loading required package: plyr
## Parsed with column specification:
## cols(
## .default = col_character(),
## childid = col_integer(),
## `height-in` = col_integer(),
## weight = col_integer(),
## ncmeccasenumber = col_integer()
## )
## See spec(...) for full column specifications.
For the ETL script, the first step involved separating the data into measures and dimensions as the clean-up varied depending on its content. Dimensiona are non-numeric values while measures are numeric values. For the dimensions, we removed any hyphens, changed the & symbol to the word “and” and replaced any NULL values with an empty string. Hyphens were removed more for aesthetics. Similarly, changing the & symbol to the word “and” was also another aesthetic item. In contrast, replacing any NULL values with an empty string allowed for better processing of the data. NULL exists for values that are unknown in the data; however, the presence of a NULL causes analysis to be more difficult to do. By removing these and simply replacing them with an empty string, the data is able to be analyzed with ease. As for the measures, we removed any non-numeric or numerically associated values. We kept numbers, the letter e for scientific notation, and any periods for decimals. The presence of non-numeric values or numerically related values will result in an error when attempting to make visualizations using the data. The measures were then set as numeric values. This file was then written to a new csv file. The ETL script code is displayed below.
require(readr)
require(plyr)
# Set the Working Directory to the 00 Doc folder
file_path = "../01 Data/PreETL_MediaReadyActiveCases.csv"
df <- readr::read_csv(file_path)
measures <- c("childid", "height-in", "weight","ncmeccasenumber")
dimensions <- setdiff(names(df), measures)
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
if( length(dimensions) > 0) {
for(d in dimensions) {
df[d] <- data.frame(lapply(df[d], gsub, pattern="-",replacement=""))
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
df[d] <- data.frame(lapply(df[d], gsub, pattern="NULL",replace=""))
}
}
na2zero <- function (x) {
x[is.na(x)] <- "0"
return(x)
}
if( length(measures) > 1) {
for(m in measures) {
df[m] <- data.frame(lapply(df[m], na2zero))
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^.,0-9,e]",replacement= ""))
df[m] <- data.frame(lapply(df[m], function(x) as.numeric(as.character(x))))
}
}
write.csv(df, gsub("PreETL_", "", file_path), row.names=FALSE, na = "")
These graphs show some of the interesting findings found in this data.
Figure 1 and Figure 2 show a box plot of the different races with the data summarized based on weight. Each box plot includes the 5 number summary: minimum, first quartile, median, third quartile, and maximum. They are filtered by weight and missing from date.
An interesting thing to note from the box plots is clicking through the states, some states do not have reported missing children of certain race. However, the white children are almost always shown in the box plots. For many of them, the box plot for white children has the largest distribution between minimum and maximum.
Figure 3 and Figure 4 show a histogram of the number of recorded missing children in 10 year increments with regards to the missing from date.
An interesting thing to note from the histogram is that for the most part, the number of children that have gone missing has increased over the years. An abnormal point in time is in the 2000’s. The number of missing children decreases and breaks the trend. However in the 2010’s the number spikes and increases exponentially.
Figure 5 and Figure 6 show a histogram of the number of recorded missing children separated based on their height (for the Tableau visualization) and weight (for the Shiny visualization). It is further divided by race as each race is represented by a different color on the histogram as seen in the legend.
An interesting thing to note from the histogram is that similar to the box plots, the color for white children is substantially larger than those for other races. However, looking at this specific visualization, one is able to see that although white children are most likely to go missing, Hispanic children fall in close 2nd.
Figure 7 and 8 shows a scatterplot that compares missing children weight (x-axis) versus height (y-axis) colored by gender.
This plot is interesting because the of the regression line that is placed on each gender; the female regression line is significantly less steep than the male, suggesting that the average missing female is predicted to have a higher weight/height ratio than the average missing male.
FIgure 9 and 10 show a geographic fields map using Tableau. The poverty percentage is defined as the # of people below the poverty line / total # of people in the state * 100. The state ID was used as the X and Y axis for the Tableau graph, resulting in a map with dots representing the number of cases for each state. The size of the circles corresponds to the number of missing child cases, and the color of the circles corresponds to the poverty ratio. In this case, the darker (more red), the higher the percentage of poverty. Using the missing children data and inner joins the data on the State ID, a query was run from the data.world census data that outputs the state ID, state population, and state poverty amount.
Something interesting to note is the southern states tend to have higher poverty ratios than the northern states. Additionally, the greater amount of missing children occur towards the east and west coast.
Figure 11 and 12 shows a crosstab of state vs. race of missing children. It is filtered given the parameter that the income for household is less than 25K. It is then separated between Low, Medium, and High which is defined as number of household with income level from 0k-25k over the total number of households. Using the MediaReadyActiveCases and US Census data, specifically the income data, Figure 1 was produced via Tableau while Figure 2 was produced using Shiny.
An interesting thing to note is that the majority of the childrens that go missing are from the middle range income. This is apparent across the majority of the states.
Figure 5 and Figure 6 show a histogram of the number of recorded missing children separated based on their height (for the Tableau visualization) and weight (for the Shiny visualization). It is further divided by race as each race is represented by a different color on the histogram as seen in the legend.
An interesting thing to note from the histogram is that similar to the box plots, the color for white children is substantially larger than those for other races. However, looking at this specific visualization, one is able to see that although white children are most likely to go missing, Hispanic children fall in close 2nd.
Figure 15 and 16 shows a barchart of the race of missing children vs. the count of children missing, specifically in each state. The black line shows the average number of children missing in each state across all races. Additionally, there is a table calculation within this barchart The Average Count of Children per Race - count of Children per Race is calculated and indicated on the barchart.
An interesting thing to note for this barchart is the fact that white children are more likely to go missing in each state. Additionally, typically Asian childrens are one of the least likely to go missing in each state. Figure 1 was created in Tableau while Figure 2 was created using Shiny.
Figure 17 and 18 show a geographic fields map based on longitude and latitude. The color on the map differentiates between whether females or males are the sex of the majority of the children that have gone missing. A set was made from a barchart of missingfromstate vs. the sum of people with bachelor’s degrees in each state. This data was created by joining the MediaReadyActiveCases with the Education data from the US Census Data. This set consists of the “Medium” group with bachelor’s degrees. This set was then plotted on the map. For the two figures, the type of case was also included in a popup upon hovering over the location.
Something to notice is that the majority of this set comes from the east side of the United States. Additionally, these cases are typically endangered runaway cases. Figure 3 was created via Tableau while Figure 4 was created using Shiny.
Figure 19 and Figure 20 show the number of children that have gone missing versus the year upon which they are reported missing.
An interesting thing to note is that in the last two years, specifically 2016 and 2017, the number of children reported missing has increased substantially. It was not until the year of 2010, that the number of missing children reported began to exceed 100 cases. In less than 10 years, the number has shot up exponentially to over 900 cases. FIgure 5 was created using Tableau while Figure 6 was created using Shiny.
Figures 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20 has been published to the following shiny.ios app: https://carolhuang0502.shinyapps.io/finalproject/
source("../02 Shiny/server.R")
## Loading required package: ggplot2
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: shiny
## Loading required package: shinydashboard
##
## Attaching package: 'shinydashboard'
## The following object is masked from 'package:graphics':
##
## box
## Loading required package: data.world
##
## Attaching package: 'data.world'
## The following object is masked from 'package:dplyr':
##
## query
## Loading required package: DT
##
## Attaching package: 'DT'
## The following objects are masked from 'package:shiny':
##
## dataTableOutput, renderDataTable
## Loading required package: leaflet
## Loading required package: plotly
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following objects are masked from 'package:plyr':
##
## arrange, mutate, rename, summarise
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
source("../02 Shiny/ui.R")
# server.R
require(ggplot2)
require(dplyr)
require(shiny)
require(shinydashboard)
require(data.world)
require(readr)
require(DT)
require(leaflet)
require(plotly)
shinyServer(function(input, output) {
# These widgets are for the Barcharts tab.
online2 = reactive({input$rb2})
output$races2 <- renderUI({selectInput("selectedraces", "Choose Races:", race_list, multiple = TRUE, selected='All') })
# The following query is for the select list in the Barcharts -> Barchart with Table Calculation tab.
races = query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-project-6", type="sql",
query="select distinct race as D, race as R
from MediaReadyActiveCases
order by 1"
) # %>% View()
if(races[1] == "Server error") {
print("Getting races from csv")
file_path = "../01 Data/MediaReadyActiveCases.csv"
df <- readr::read_csv(file_path)
tdf1 = df %>% dplyr::distinct(race) %>% arrange(race) %>% dplyr::rename(D = race)
tdf2 = df %>% dplyr::distinct(race) %>% arrange(race) %>% dplyr::rename(R = race)
races = bind_cols(tdf1, tdf2)
}
race_list <- as.list(races$D, races$R)
race_list <- append(list("All" = "All"), race_list)
# The following query is for the select list in the Barcharts -> Medium Bachelors Degree Level
degrees = query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-project-6", type="sql",
query="SELECT sex, casetype, missingfromcity, educationqueryresult.State, bachelors_degree, state_lat_long.State as state, Latitude, Longitude, sum(bachelors_degree) as sum_bac
FROM MediaReadyActiveCases s inner join
educationqueryresult a
ON (s.missingfromstate = a.State) inner join
state_lat_long c
ON (s.missingfromstate = c.State)
where c.State in (a.State)
group by missingfromstate
having sum(bachelors_degree)> 32000000 and sum(bachelors_degree)< 490000000"
) # %>% View()
# The following query is for hte select list in the Barcharts -> Missing By Year
sales = query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-project-6", type="sql",
query="select CAST (year(CAST (missingreporteddate AS date)) as string) as year, count(childid) as record
from MediaReadyActiveCases
group by CAST (year(CAST (missingreporteddate AS date)) as string)
order by year"
) # %>% View()
# Begin Box Plot Tab ------------------------------------------------------------------
dfbp1 <-
query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-final-project", type="sql",
query="select weight,race,missingfromdate
from MediaReadyActiveCases
where weight<400") # %>% View()
output$boxplotData1 <- renderDataTable({DT::datatable(dfbp1, rownames = FALSE,
extensions = list(Responsive = TRUE,
FixedHeader = TRUE)
)
})
output$boxplotPlot1 <- renderPlotly({
means <- aggregate(weight ~ race,dfbp1, mean)
rounded_means <- round(means$weight,1)
means$weight <- rounded_means
p <- ggplot(dfbp1, aes(x=race,y=weight)) +
geom_boxplot() +
stat_summary(fun.y = "mean", colour="darkred", geom="point", shape=18, size=2,show_guide = FALSE) +
geom_text(data = means, aes(label = weight, y = weight+10))
ggplotly(p)
})
# End Box Plot Tab ___________________________________________________________
# Begin Histogram Tab ------------------------------------------------------------------
dfh1 <- query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-final-project", type="sql",
query="select year
from Year_query") # %>% View()
output$histogramData1 <- renderDataTable({DT::datatable(dfh1,
rownames = FALSE,
extensions = list(Responsive = TRUE, FixedHeader = TRUE) )
})
output$histogramPlot1 <- renderPlot({ggplot(dfh1) +
geom_histogram(aes(x=dfh1), binwidth = 10, fill = I("blue")) + labs(x = "Decade", y = "Count") +
theme(axis.text.x=element_text(angle=90, size=10, vjust=0.5))
})
dfh2 <- query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-final-project", type="sql",
query="select weight, race
from MediaReadyActiveCases
where weight < 500") # %>% View()
output$histogramData2 <- renderDataTable({DT::datatable(dfh2,
rownames = FALSE,
extensions = list(Responsive = TRUE, FixedHeader = TRUE) )
})
output$histogramPlot2 <- renderPlot({ggplot(dfh2) + geom_histogram(aes(x=weight, fill=race, color="black"), binwidth = 10) + labs(x = "Weight (lbs)", y = "Count")+
theme(axis.text.x=element_text(angle=90, size=10, vjust=0.5))
})
# End Histogram Tab ___________________________________________________________
# Begin Barchart Tab ------------------------------------------------------------------
df2 <- eventReactive(input$click2, {
if(input$selectedraces == 'All') race_list <- input$selectedraces
else race_list <- append(list("Skip" = "Skip"), input$selectedraces)
if(online2() == "SQL") {
print("Getting from data.world")
tdf = query(
data.world(propsfile = "www/.data.world"),
dataset="carolhuang0502/s-17-dv-project-6", type="sql",
query="select missingfromstate, race, count(childid) count_childid
from MediaReadyActiveCases
where ? = 'All' or race in (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
group by missingfromstate, race",
queryParameters = race_list
) # %>% View()
}
else {
print("Getting from csv")
file_path = "../01 Data/MediaReadyActiveCases.csv"
df <- readr::read_csv(file_path)
tdf = df %>% dplyr::filter(race %in% input$selectedraces | input$selectedraces == "All") %>%
dplyr::group_by(missingfromstate, race) %>%
dplyr::summarize(count_childid = count(childid)) # %>% View()
}
# The following two lines mimic what can be done with Analytic SQL. Analytic SQL does not currently work in data.world.
tdf2 = tdf %>% group_by(missingfromstate) %>% summarize(window_count_childid = mean(count_childid))
dplyr::inner_join(tdf, tdf2, by = "missingfromstate")
# Analytic SQL would look something like this:
# select missingfromstate, race, count_childid, avg(count_childid)
# OVER (PARTITION BY missingfromstate ) as window_avg_childid
# from (select missingfromstate, race, count(childid) count_childid
# from MediaReadyActiveCases
# group by missingfromstate, race)
})
output$barchartData1 <- renderDataTable({DT::datatable(df2(),
rownames = FALSE,
extensions = list(Responsive = TRUE, FixedHeader = TRUE) )
})
output$barchartData2 <- renderDataTable({DT::datatable(degrees,
rownames = FALSE,
extensions = list(Responsive = TRUE, FixedHeader = TRUE) )
})
output$barchartData3 <- renderDataTable({DT::datatable(sales,
rownames = FALSE,
extensions = list(Responsive = TRUE, FixedHeader = TRUE) )
})
output$barchartPlot1 <- renderPlot({ggplot(df2(), aes(x=race, y=count_childid)) +
scale_y_continuous(labels = scales::comma) + # no scientific notation
theme(axis.text.x=element_text(angle=0, size=12, vjust=0.5)) +
theme(axis.text.y=element_text(size=12, hjust=0.5)) +
geom_bar(stat = "identity") +
facet_wrap(~missingfromstate, ncol=1) +
coord_flip() +
# Add count_childid, and (count_childid - window_avg_childid) label.
geom_text(mapping=aes(x=race, y=count_childid, label=round(count_childid, digits = 1)),colour="black", hjust=-.5) +
geom_text(mapping=aes(x=race, y=count_childid, label=round(window_count_childid - count_childid, digits = 1)),colour="blue", hjust=-2) +
# Add reference line with a label.
geom_hline(aes(yintercept = round(window_count_childid)), color="red") +
geom_text(aes( -1, window_count_childid, label = window_count_childid, vjust = -.5, hjust = -.25), color="red")
})
output$barchartMap1 <- renderLeaflet({leaflet(width = 400, height = 800) %>%
setView(lng = -98.35, lat = 39.5, zoom = 4) %>%
addTiles() %>%
addMarkers(lng = degrees$Longitude,
lat = degrees$Latitude,
options = markerOptions(draggable = TRUE, riseOnHover = TRUE),
popup = as.character(paste("City: " , degrees$missingfromcity,
", State: ", degrees$missingfromcity,
", Case Type: ", degrees$casetype,
", Sex: ", degrees$sex
)) )
})
output$barchartPlot2 <- renderPlotly({p <- ggplot(sales, aes(x=as.character(year), y=record)) +
theme(axis.text.x=element_text(angle=0, size=7, vjust=0.5)) +
theme(axis.text.y=element_text(size=12, hjust=0.5)) +
geom_bar(stat = "identity")+
xlab("\nYear")+ylab("\nRecord")
ggplotly(p)
})
# End Barchart Tab ___________________________________________________________
})
#ui.R
require(shiny)
require(shinydashboard)
require(DT)
require(leaflet)
require(plotly)
dashboardPage(
dashboardHeader(
),
dashboardSidebar(
sidebarMenu(
menuItem("Box Plots", tabName = "boxplot", icon = icon("dashboard")),
menuItem("Histograms", tabName = "histogram", icon = icon("dashboard")),
menuItem("Scatter Plots", tabName = "scatter", icon = icon("dashboard")),
menuItem("Crosstabs, KPIs, Parameters", tabName = "crosstab", icon = icon("dashboard")),
menuItem("Barcharts, Table Calculations", tabName = "barchart", icon = icon("dashboard"))
)
),
dashboardBody(
tags$style(type="text/css",
".shiny-output-error { visibility: hidden; }",
".shiny-output-error:before { visibility: hidden; }"
),
tabItems(
# Begin Box Plots tab content.
tabItem(tabName = "boxplot",
tabsetPanel(
tabPanel("Data",
DT::dataTableOutput("boxplotData1")
),
tabPanel("Simple Box Plot",
plotlyOutput("boxplotPlot1", height=500))
)
),
# End Box Plots tab content.
# Begin Histogram tab content.
tabItem(tabName = "histogram",
tabsetPanel(
tabPanel("Data",
DT::dataTableOutput("histogramData1"),
hr(),
DT::dataTableOutput("histogramData2")
),
tabPanel("Histogram 1", plotOutput("histogramPlot1", height=900)),
tabPanel("Histogram 2 ", plotOutput("histogramPlot2", height=900))
)
),
# End Histograms tab content.
# Begin Scatter Plots tab content.
tabItem(tabName = "scatter",
tabsetPanel(
tabPanel("Data",
radioButtons("rb3", "Get data from:",
c("SQL" = "SQL",
"CSV" = "CSV"), inline=T),
uiOutput("scatterStates"), # See http://shiny.rstudio.com/gallery/dynamic-ui.html,
actionButton(inputId = "click3", label = "To get data, click here"),
hr(), # Add space after button.
DT::dataTableOutput("scatterData1")
),
tabPanel("Simple Scatter Plot", plotlyOutput("scatterPlot1", height=1000))
)
),
# End Scatter Plots tab content.
# Begin Crosstab tab content.
tabItem(tabName = "crosstab",
tabsetPanel(
tabPanel("Data",
radioButtons("rb1", "Get data from:",
c("SQL" = "SQL",
"CSV" = "CSV"), inline=T),
sliderInput("KPI1", "KPI_Low:",
min = 0, max = .1, value = .1),
sliderInput("KPI2", "KPI_Medium:",
min = .1, max = .2, value = .2),
actionButton(inputId = "click1", label = "To get data, click here"),
hr(), # Add space after button.
DT::dataTableOutput("data1")
),
tabPanel("Crosstab", plotOutput("plot1", height=1000))
)
),
# End Crosstab tab content.
# Begin Barchart tab content.
tabItem(tabName = "barchart",
tabsetPanel(
tabPanel("Data",
radioButtons("rb2", "Get data from:",
c("SQL" = "SQL",
"CSV" = "CSV"), inline=T),
uiOutput("races2"), # See http://shiny.rstudio.com/gallery/dynamic-ui.html
actionButton(inputId = "click2", label = "To get data, click here"),
hr(), # Add space after button.
'Here is data for the "Barchart with Table Calculation" tab',
hr(),
DT::dataTableOutput("barchartData1"),
hr(),
'Here is data for the "Medium Bachelors Degree Level" tab',
hr(),
DT::dataTableOutput("barchartData2"),
hr(),
'Here is data for the "Missing by Year" tab',
hr(),
DT::dataTableOutput("barchartData3")
),
tabPanel("Race of Missing Children per State with Table Calculation", "Black = Count of Children per Race, Red = Average Count of Children per missingfromstate, and Blue = (Average Count of Children per Race - Count of Children per Race)", plotOutput("barchartPlot1", height=6000)),
tabPanel("Medium Bachelors Degree Level", leafletOutput("barchartMap1"), height=900 ),
tabPanel("Missing by Year", plotlyOutput("barchartPlot2",width=1300,height=800) )
)
)
# End Barchart tab content.
)
)
)
library(knitr) f = system.file